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Abstract 


A  family  of  syntactic  complexity  metrics  which  contains  a 
number  of  current  metrics  is  defined.  The  family  is  used  as  a 
basis  for  experimental  analysis  of  metrics.  Once  the  family  has 
been  implemented,  several  metrics  may  be  readily  formed  and  com¬ 
puted.  This  paper  uses  the  family  to  compare  a  few  simple  syn¬ 
tactic  metrics  to  each  other.  The  study  also  indicates  that 
individual  differences  have  a  large  impact  on  the  significance  of 
results  where  many  individuals  are  used.  A  metric  for  determin¬ 
ing  the  relative  skills  of  programmers  at  handling  a  given  level 
of  complexity  is  also  suggested.  The  study  uses  the  metrics  to 
Remonstrate  differences  between  projects  on  which  a  methodology 
was  used  versus  those  on  which  it  was  not. 
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1  .  Introduction 


As  computer  scientists  attempt  to  understand  the  software 
process  and  product,  it  is  natural  to  try  to  measure  those 
aspects  of  software  which  seem  to  affect  cost.  A  major  problem 
in  computer  science  is  the  intellectual  control  of  design  which 
is  directly  related  to  the  complexity  of  the  product.  Many 
attempts  at  quantifying  the  complexity  of  computer  programs  have 
been  made  [Basili  &  Reiter  79»  Chen,  Curtis  et  al,  Dunsmore  <5 
Gannon  80,  Halstead,  McCabe,  Sunohara  et  al].  A  good  complexity 
metric  could  be  used  as  a  quality  assurance  test  by  software 
developers  and  even  as  a  contractual  obligation.  Current  com¬ 
plexity  metrics  may  be  roughly  divided  into  two  basic  groups,  (l) 
static  metrics  which  are  measures  of  the  product  at  one  particu¬ 
lar  point  in  time,  and  (2)  history  metrics  which  are  measures  of 
the  product  and  process  taken  over  time.  This  paper  will  deal 
with  static  complexity  metrics,  based  upon  the  physical  attri¬ 
butes  of  a  software  product.  These  fall  into  three  basic 
categories:  volume ,  control  organization,  and  data  organization. 
Each  of  these  categories  will  be  discussed  briefly  below. 

Volume  metrics  are  measures  of  the  size  of  a  product;  for 

example,  the  number  of  lines  of  code,  the  number  of  statements, 
or  the  number  of  operators  pnd  operands  [Halstead].  Of  course, 
the  software  science  volume  metric  is  in  this  group.  Even 
cyclomatic  qomplexity  [McCabe]  can  be  placed  in  this  category 
since  it  is  the  number  of  decisions  plus‘<one.  The  number  of  pro¬ 
cedures,  the  average  length  of  procedures,  and  the  number  of 
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variables  are  examples  of  volume  metrics.  The  number  of 
input/output  formats  [Carriere  <5  Thibodeau]  and  other  abstraction 
metrics  are  volume  metrics  as  well.  Note  that  these  are  measures 
of  the  logical  size,  rather  than  just  the  physical  size,  of  a 
program. 

Control  organization  metrics  are  measures  of  the  comprehen¬ 
sibility  of  control  structures.  Thus  cyclomatic  complexity,  when 
viewed  as  the  number  of  control  paths,  is  also  a  control  metric. 
Knots  [Woodward  et  al]  and  Maximal  Intersection  Number  [Chen] 
attempt  to  measure  the  control  complexity  by  visual  properties  of 
program  control,  either  as  it  is  written  (in  a  computer  language) 
or  as  it  would  appear  in  a  planer  flow  graph.  Average  nesting 
level  has  been  shown  to  be  a  useful  control  organization  metric 
[Dunsmore].  Essential  complexity  [McCabe],  which  will  be  dis¬ 
cussed  later,  falls  in  this  category  as  well. 


Data  organization  metrics  are  measures  of  data  visibility 
and  use  as  well  as  the  interactions  between  data  within  a  pro¬ 
gram.  Data  binding  [Basili  A  Turner  76;  Stevens,  Myers  A  Con¬ 
stantine]  is  an  example  of  a  module  interaction  metric.  A  span 
■[Elsb.Of.fJ  is  . an  attempt  to  measure  the  proximity  of  references  to 
each  data  item.  As  such  it  qualifies  as  a  data  organization 
metric.  Slicing  [Weiser]  can  also  be  considered  as  a  data  organ¬ 
ization  metric.  A  slice  is  that  (not  necessarily  consecutive) 
portion  of  code  which  is  necessary  in  order  to  produce  some  par- 
tial  output  from  the  program. 
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2.  Definition  and  Analysis  of  a  Structural  Metric  Family 

All  of  the  above  metrics  have  failed  to  gain  full  acceptance 
as  a  valid  measure  of  program  complexity  for  at  least  two  rea¬ 
sons.  First,  there  is  a  lack  of  experimental  evidence  to  deter¬ 
mine  what  aspects  of  the  system  life  cycle  the  metric  actually 
explains.  While  a  metric  could  correlate  well  with  debugging 
time,  it  might  still  be  a  poor  predictor  of  the  effort  required 
to  do  maintenance.  We  need  experimental  evidence  that  is  focused 
on  the  expected  uses  of  the  metrics.  Second,  existing  metrics 
are  static  (non-parameterized j  so  they  cannot  be  tuned  to  the 
results  of  exploratory  analysis. 

(H  ' 

A  complete  development  of  the  structural  family  of  complex¬ 
ity  metrics  may  be  found  in  [Basili  &  Hutchens].  The  structural 
family  includes  many  metrics  from  the  volume  and  control  organi¬ 
zation  groups.  The  data  organization  group  is  a  subject  for 
future  research. 

If  the  family  is  to  include  many  of  the  metrics  in  the 
literature  it  must  incorporate  length,  nesting  level,  control 
paths,  types  of  control  structures,  and  decomposition  simplicity. 
The  family  should  transcend  languages  (although  specific  members 
may  not).  The  various  members  may  relate  to  many  aspects  of 
software  development  and  maintenance  although  any  one  metric  may 
only  be  useful  in  a  limited  way. 

Length  can  be  measured  by  lines  of  code,  with  or  without 
comments.  However,  in  a  free  format  language  this  measure  can  be 
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altered  by  cosmetic  revisions  of  the  code,  so  the  number  of 
statements  seems  to  be  a  more  consistent  measure.  Nesting  level 
might  be  included  explicitly  or  as  a  factor  to  be  multiplied  with 
the  complexity  of  the  lower  levels.  Control  paths  and  types  of 
control  structures  are  closely  related  and  are  handled  in  a 
variety  of  ways  by  current  metrics,  so  the  family  must  allow  a 
general  mechanism  for  these  concepts.  Decomposition  simplicity 
is  intended  to  measure  the  naturalness  with  which  the  intended 
function  is  broken  into  smaller  functions. 


With  these  concepts  in  mind,  a  recursive  definition  of  a 

family  of  control  structure  complexity  metrics  (c)  could  be  given 

«  ' 

by : 


k 

c ( p )  -  b  >  c(pi)  +  f(n,lev,t,s) 
i  3 1 


where  p  is  a  program  which  is  decomposed  in  some  fashion  into  k 
components  pi,  p2 ,  ...,  pk.  The  parameter  b  is  used  to  generate 
the  multiplier  for  nesting  level.  The  function  f,  the  key  to  the 
metric,  has  four  arguments:  n,  the  number  of  decisions  in  program 
p  which  are  not  part  of  a  particular  subcomponent;  lev,  the  nest¬ 
ing  level  of  component  p;  t,  the  type  of  structure  instantiated 
by  p;  and  s,  the  structural  "niceness"  of  p. 

Some  discussion  of,  and  restrictions  on,  the  parameters  will 
clarify  their  meaning.  b  is  intended  to  penalize  nesting  so  b  _> 
1 ;  where,  of  course,  b-1  just  removes  it  from  the  formula.  Since 
an  increase  in  the  number  of  decisions  should  not  decrease  the 
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complexity,  f  should  be  a  nondecreasing  function  of  n.  At  first 
glance,  one  might  be  tempted  to  place  a  nondecreasing  condition 
on  f  with  respect  to  the  level,  lev.  However,  there  is  reason  to 
believe  that  a  concave  up  function  (one  which  falls  at  first  and 
latter  rises)  of  lev  may  be  better  [Dunsmore].  An  example  may  be 
found  in  [Basili  &  Hutchens]. 

It  should  be  noted  that  b  is  in  fact  superfluous,  for  the 
metric 


k 

c(p)  ■  b  >  c(pi)  ♦  f(n,lev,t,s) 

T="1 

*  k 

■  >  c(pi)  +  f ' (n, lev, t, s) 
i-1 

lev 

where  f ' (n, lev, t, s)  »  b  f (n, lev, t, s) . 

In  this  example,  b  is  reduced  to  a  constant  in  the  function  f'. 
The  use  of  the  constant,  b,  makes  the  penalties  more  explicit 
than  does  hiding  that  information  in  the  function.  Indeed,  many 
instantiations  may  use  b  instead  of  lev. 

The  values  of  t  normally  range  over  syntactic  entities,  such 
as  while,  case,  and  if  statements.  The  parameter  s  is  used  to 
determine  if  the  control  structure  is  "nice."  More  specifically 
it  normally  has  two  values,  "structured  construct"  and  "non- 
structured"  construct. 


A 


The  control  flow  of  a 
program  (equating  the 


program  may  be  described  by  a  digraph, 
program  and  its  digraph)  is  called  a 
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and 


pro pe r  program  if  it  has  a  single  entry  and  a  single  exit, 
every  node  of  the  program  lies  on  some  path  from  the  entry  to  the 
exit.  A  proper  program  is  called  a  prime  program  if  it  contains 
no  proper  subprograms  with  two  or  more  nodes.  The  usual  while  do 
od  and  i_f  then  else  f i  are  examples  of  common  prime  programs.  A 
prime  decomposition  is  found  by  continually  replacing  prime  sub¬ 
programs  by  function  nodes  (a  node  with  a  single  entry  and  a  sin¬ 
gle  exit).  A  proper  program  has  a  unique  prime  decomposition  if 
successive  sequences  are  treated  as  a  unit  [Linger,  Mills  & 
Witt]. 

By  letting  the  parameter  s  have  the  two  values  1 )  proper  and 
2)  not  proper,  the  resulting  (sub)family  is  given  by: 

k  /  f (n, lev, t)  ;  p  proper 

c(p)  -  b  >  c(pi)  ♦  < 

i"1  \  g(n,lev,t)  ;  p  not  proper 

This  restricted  family  will  be  the  subject  of  the  rest  of  this 
paper.  If  one  assumes  that  proper  programs  are  less  complex  than 
non-proper  programs  then  f(n,lev,t)  £  g(n,lev,t)  for  all  n,  lev, 
and  t.  The  restricted  family  might  reasonably  be  called  a  syn¬ 
tactic  complexity  family  since  it  is  based  on  the  syntactic 
decompositions  of  the  program. 

_3 .  Some  Members  o f  the  Family 

The  decomposition  of  p  into  pi,  p2 ,  ...,  pk  can  be  based  on 

the  syntactic  structure  of  the  language.  One  major  benefit  of 
this  approach  is  the  ease  with  which  a  compiler  can  be  changed 
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into  an  automatic  metric  tool 


As  a  simple  example,  consider  the 
decomposition  of  programs  into  statements  (and  statements  into 
substatements)  where 

k  /  1  ;  p  a  statement 

c(p)  *  He  (pi)  +  < 

i-1  \  0  ;  otherwise. 

Note  that  this  uses  the  t  parameter  of  the  family.  The  resultant 
measure  is  nothing  more  than  a  statement  count  volume  metric. 

Cyclomatic  complexity  may  he  generated  by  counting  the  deci¬ 
sion  constructs  in  the  program  plus  the  number  of  segments 

[McCabe].  The  measure  is  just 

*  ' 

k  /  1  ;  p  a  decision  or  segment 

c(p)  *  >_c(pi)  +  < 

i-1  \  0  ;  otherwise 

and  eventually  each  decision  will  be  counted  exactly  once. 
Therefore,  the  member  is  just  the  cyclomatic  complexity. 

The  last  example  of  the  members  of  the  family  follows: 

k  /  1+log2(n+l)  ;  p  proper 

(1 )  c(p)  -  1.1  >  c ( pi )  +  < 

i“1  \  2* ( 1 +log2 ( n+ 1 ) )  ;  p  not  proper 

This  member  exhibits  some  of  the  flexibility  of  the  family.  The 
b  value  of  1.1  penalizes  nesting  by  counting  each  statement  10$ 
more  than  it  would  be  at  the  next  outer  level.  Furthermore, 
poorly  structured  code  cost  twice  as  much  as  well  structured 
code.  Each  statement  must  contribute  at  least  one  to  the  measure' 
due  to  the  addition  of  1  in  each  of  the  functions  f  and  g.  The 
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use  of  the  logarithm  encourages  the  use  of  case  statements,  the 
only  standard  control  structure  with  more  than  one  decision  node. 
Thus,  this  metric  includes  consideration  of  nesting  level,  length 
(statement  count),  structured  programming  practices,  and  bonuses 
for  use  of  an  organizing  construct  (the  case  statement). 

Several  other  members  of  the  family,  including  essential 
complexity  and  the  software  science  count  of  total  operators  and 
operands,  are  derived  in  [Basili  &  Hutchens]. 

£.  Experimentation 

This  research  focuses  on  the  ability  of  product  metrics  to 
explain  the  number  of  program  changes  made  during  development  as 
well  as  the  differences  in  the  metrics  caused  by  different 
development  strategies.  Program  changes  were  defined  by 
[Dunsmore  4  Gannon  77]  and  shown  to  be  closely  related  to  the 
number  of  errors  made  during  development.  The  product  metrics 
used  are  from  the  syntactic  complexity  family.  The  syntactic 
complexity  family  with  proper  verses  not  proper  statement  dis¬ 
tinctions  has  been  implemented  in  the  SIMPL-T  compiler  [Basili  & 
Turner  75].  SIMPL-T  is  a  GOTO-less  non-block  structured  language 
which  allows  statement  nesting.  Loops  may  be  abnormally  exited 
using  the  EXIT  statement  and  RETURNS  are  allowed  at  any  point. 
SIMPL-T  is  used  in  many  courses  at  the  University  of  Maryland. 

The  research  reported  in  this  paper  uses  a  database  of  19 
compilers  written  by  upperclassmen  and  graduate  students  at  the  - 
University  of  Maryland.  The  compilers  were  written  under  three 
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different  development  methodologies:  ad  hoc  individuals  (Al),  ad 
hoc  teams  (AT),  and  disciplined  teams  (DT).  The  ad  hoc  individu¬ 
als  and  ad  hoc  teams  were  not  given  any  particular  methodologies 
or  techniques  to  be  used  in  the  implementation.  They  were  free 
to  organize  the  project  in  any  way  they  desired.  The  disciplined 
teams  were  required  to  use  a  list  of  methodologies  and  techniques 
which  were  taught  in  their  class.  These  methodologies  included 
chief  programmer  teams,  walkthroughs,  and  top  down  design  with 
PDL,  among  others.  Several  metrics  have  already  been  tested  to 
see  if  they  detect  the  differences  among  the  groups  [Basili  & 
Reiter  79 , 81 ] . 

«  ’ 

The  results  reported  here  deal  with  the  metric  defined  in 
equation  (l)  (referred  to  as  SC  in  the  rest  of  this  paper)  as 
well  as  statement  counts  (ST),  call  count  (CA,  the  number  of 
calls  including  procedures  and  functions  whether  user  defined  or 
predefined  in  the  language)  cyclomatic  c omplexi ty ( CV  ) ,  and  the 
number  of  decision  statements  (DS).  Appendix  1  contains  the 
coefficients  of  determination  (r  square)  and  the  slope  of  the 
lines  for  each  of  the  projects  and  each  of  the  metrics  using  sim¬ 
ple  regression  analysis  [Neter  &  Wasserman]. 

The  five  metrics  considered  here  are  highly  correlated  as 
may  be  seen  in  the  correlation  matrix  in  Table  1.  For  this  rea¬ 
son,  multiple  regression  equations  tended  to  be  erratic,  with  the 
coefficients  changing  greatly  with  the  addition  of  new  variables, 
while  producing  minimal  increases  in  R  square.  Of  the  19  pro¬ 
jects,  the  addition  of  a  second  variable  had  one  of  three  basic 
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Correlation  Matrix 
for  the  product  metricu 

ST  SC  CA  CV  DS 


ST 

1  .000 

SC 

•  975 

1  .000 

CA 

.845 

.770 

1  .000 

CV 

.879 

.893 

.747 

1  .000 

DS 

.873 

.939 

.617 

.832 

Table  1 

results.  Four  of  the  projects  found  no  second  variable  to  be  of 
any  value  at  all.  Five  of  the  projects  found  a  second  variable 
that  was  moderately  useful.  The  ten  remaining  projects  added  a 
second  variable  whose  coefficient  was  negative,  while  doubling  or 
tripling  the  coefficient  of  the  original  variable,  indicating 
that  the  two  variables  were  highly  related  and  making  the  regres¬ 
sion  mo'del  questionable.  Therefore,  the  rest  of  this  paper  deals 
only  with  simple  regression  equations. 

4_. j_.  The  Effects  of  Individuals 

The  first  attempt  at  comparing  the  metrics  with  the  number 
of  program  changes  was  done  between  projects.  That  is,  the 
number  of  program  changes  for  each  project  was  determined  and  the 
metrics  were  computed  on  each  segment  (procedure  or  function)  in 
each  project.  The  complexity  of  the  project  was  then  computed  by 
several  methods.  These  include  summing  the  complexity  of  each 
segment  as  well  as  summing  only  the  most  complex  segments  (such 
as  the  top  10  or  20  percent).  All  of  these  attempts  at  correlat- 


ing  the  number  of  program  changes  to  the  complexity  of  the  pro¬ 
ject  were  unsuccessful. 

Graph  1  gives  an  example  of  the  scatterplot  of  a  metric  (in 
this  case  SC)  with  the  number  of  program  changes,  where  both 
values  are  summed  over  each  project.  The  relationship  between 
the  metric  and  the  number  of  program  changes  is  almost  non¬ 
existent  . 

.However,  each  of  the  complexities  of  the  individual  segments 
in  one  project  was  correlated  to  the  number  of  program  changes 
made  in  that  particular  segment.  Only  the  6  projects  that  were 
developed  by  ad  hoc  individuals  were  used  in  this  part  of  the 
study.  The  coefficient  of  determination  (r  square)  for  SC  as  a 
predictor  of  program  changes  ranged  between  .475  and  .866  for 
these  6  projects.  The  other  metrics  had  slightly  lower  values 
but  a  similar  spread  (see  Appendix  1).  Therefore,  when  an  indi¬ 
vidual  is  isolated  it  appears  that  these  metrics  do  correlate 
well  with  the  number  of  program  changes. 

Graph  2  gives  an  example  plot  of  a  metric  (again  SC)  with 
program  changes  where  only  one  ad  hoc  individual's  project  is 
considered.  Each  point  represents  a  segment  (procedure  or  func¬ 
tion)  in  the  final  delivered  product.  It  is  clear  that  this 
approach  yields  a  much  closer  relationship  between  the  variables 
of  interest  than  the  inter-product  comparison  of  Graph  1 . 

It  is  somewhat  surprising  that  a  linear  fit  does  better  than 
an  exponential  fit  for  almost  all  cases  in  terms  of  both  r  square 


and  the  distribution  of  the  residuals.  Many  have  argued  that 
segments  should  be  made  very  small  to  control  their  complexity. 
An  exponential  fit  would  imply  that  the  argument  ia  valid,  since 
the  sum  of  the  complexities  of  several  very  small  segments  would 
be  much  smaller  than  the  complexity  of  one  larger  segment  with 
the  same  amount  of  code.  However,  a  linear  fit  implies  that 
there  is  no  advantage  to  splitting  a  large  segment  into  many 
smaller  segments  unless  duplication  of  code  could  be  reduced. 

The  19  projects  did  demonstrate  a  linear  fit  for  all  five 
metrics.  Only  a  couple  of  isolated  fits  yielded  any  improvement 
using  exponential  fits.  The  straight  lines  do  intersect  fairly 
cl,ose  to  the  origins,  so  the  poor  fit  of  the  exponential  is  not 
caused  by  missing  the  low  valued  points  due  to  forcing  the  curve 
through  the  origin. 

It  is  highly  probable  that  the  linear  model  appears  to  fit 
best  because  the  segments  are  so  small  (the  average  "maximum  seg¬ 
ment  size”  for  the  19  projects  is  66  statements).  The  exponen¬ 
tial  tail  might  show  up  if  there  were  larger  (more  complex)  seg¬ 
ments.  It  is  also  possible  that  programmers  naturally  limit 
themselves  to  smaller  segments  where  they  can  handle  the  complex¬ 
ity  level. 

More  interesting,  however,  is  the  fact  that  the  slopes  of 
the  fitted  lines  varied  from  .157  to  .729  for  SC.  Similar  varia¬ 
tion  exists  for  the  slopes  of  the  other  metrics.  The  slope  of 
the  line  may,  in  fact,  be  viewed  as  a  measure  of  a  programmer's 
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ability  to  cope  with  complexity  since  it  is  just  the 
program  changes  he  makes  in  developing  a  program  for  each  unit  of 
complexity.  This  interpretation  is  possible  because  the  inter¬ 
cepts  of  the  regression  lines  are  close  to  zero.  It  is  this 
variation  which  accounts  for  the  lack  of  results  using  several 
projects  produced  by  different  people. 

Experimentation  which  combines  the  work  of  different  people 
is  likely  to  contain  a  large  amount  of  noise.  Any  results  which 
can  be  obtained  are  of  course  not  invalidated  by  the  noise.  How¬ 
ever,  results  may  not  show  statistically  even  when  they  exist  in 
the  underlying  population.  Indeed,  this  phenomena  alone  may  be 
the  cause  of  many  failed  experiments. 

Slope  Metrics 

Using  the  slope  as  an  indication  of  a  programmer's  ability 
to  cope  with  complexity  gives  hope  of  producing  an  experiment 
which  can  quantify  a  programmer's  limitations  with  respect  to 
the  complexity  of  various  applications.  The  results  might  be 
used  for  management  decisions  such  as  assignment  of  tasks. 

The  results  presented  here,  however,  do  not  give  a  total 
picture  of  the  individual's  ability  to  cope  with  complexity.  One 
complexity  metric  is  not  powerful  enough  to  represent  the  diffi¬ 
culty  of  the  task. 

Since  a  single  complexity  metric  is  unlikely  to  cover  all 
aspects  of  complexity,  it  may  be  possible  for  a  programmer  to 
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shift  the  difficulty  of  development  to  unmeasured  areas.  A  vec¬ 
tor  of  metrics  (and  corresponding  slopes)  might  give  a  better 
indication  of  the  ability  of  the  programmer  to  cope  with  complex¬ 
ity.  In  fact,  such  a  vector  may  be  useful  in  determining  how  to 
allocate  the  available  programmer  resources  so  that  each  is  work¬ 
ing  on  problems  where  the  complexity  is  expected  to  be  of  the 
type  that  he  is  most  capable  of  handling. 


One  advantage  of  a  slope  metric  is  its  independence  of  the 
specification  (as  long  as  the  specification  is  not  changing). 
Note  that  in  the  case  of  this  experiment,  the  specification  for 
each  of  the  segments  in  a  given  product  is  different.  It  might 
therefore  be  possible  to  take  measurements  from  the  regular  work 
of  the  programmers  over  a  long  period  of  time  and  avoid  the  need 
for  a  special  experiment.  Thus  the  programmers  will  not  need  to 
be  specifically  aware  of  the  experiment  so  their  performance 
would  not  be  affected  by  any  reactions  to  the  experimental  situa¬ 
tion. 


The  benefit  of  a  derived  metric  like  slope  might  still  be 
realizable  even  if  the  fits  are  nonlinear.  For  example,  if  the 
relationship  is  exponential,  the  value  of  the  exponent  might  pro¬ 
vide  a  measure  of  the  programmers  limitations.  The  use  of 
metrics  in  the  evaluation  of  programmer's  ability  to  copy  with 
complexity  is  an  area  that  warrants  considerable  research  atten¬ 
tion. 
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4. .3.  Comparison  of  Metrics 

The  five  members  of  the  family  have  been  compared  to  see  how 
well  they  predict  the  number  of  changes  that  were  made  to  each 
segment.  The  results  are  summarized  in  Table  2. 

For  each  project,  the  coefficient  of  determination  was  com¬ 
pared  over  the  five  metrics.  Friedman's  test  [Conover]  is 
employed  to  determine  globally  (over  all  five  metrics)  whether 
there  is  reason  to  believe  that  any  of  the  metrics  performs  sig¬ 
nificantly  differently  from  the  others.  After  concluding  that 
there  is  a  difference  in  the  metrics  at  the  .02  level  of  signifi¬ 
cance,  a  two-tailed  sign  test  [Siegel]  was  used  pairwise  to  test 
the  null  hypothesis  that  the  metrics  have  equal  predictive  value. 
If  the  level  of  significance  was  less  than  .20,  the  alternative 
hypothesis  (that  there  is  a  difference)  with  the  direction  of 
difference  was  listed  in  Table  2.  Otherwise,  the  two  metrics  are 
listed  as  indicating  that  we  may  not  reject  the  null 

Metric  comparisons 
(using  the  sign  test) 

Friedman  yields  a  .02  level 


"SC 
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ST" 
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"SC 
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CV" 

at  .167 
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DS" 

at  .063 
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(14/19) 

"sc 
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(15/19) 

"C  V 

< 

ST" 

at  .019 

level 

(  4/19) 

"CV 

K 
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(  7/19) 

"CV 

m 

CA" 

(10/19) 

"DS 

m 

ST" 

(  7/19) 

"DS 

- 

CA" 

(11/19) 

"CA 

< 

ST" 

at  .063 

level 

(  5/19) 

Table  2 
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hypothesis.  The  last  column  contains  the  ratio  of  the  times  that 
the  first  listed  metric  had  a  better  (higher)  r  square  than  the 
second  metric*  to  the  total  number  of  data  points  in  the  group. 

It  should  be  noted  that  significance  levels  of  .2  are  not 
particularly  strong,  and  in  fact  anything  over  .05  is  perhaps 
best  read  as  indicative  of  a  possible  trend  but  inconclusive  in 
the  current  study. 

The  results  show  that  ST  does  better  than  CV  and  CA  in 
explaining  the  number  of  program  changes.  Moreover,  there  is  an 
indication  that  SC  is  better  than  CA,  CV,  or  DS.  There  is  no 
distinguishable  difference  between  SC  and  ST  or  between  CA,  CV, 
and  DS. 

Since  the  statement  count  is  very  easy  to  calculate  and  many 
researchers  have  found  that  it  does  a  credible  job  of  measuring 
the  complexity,  it  must  be  considered  as  the  metric  to  beat  in 
studies  of  this  kind.  This  study  has  failed  to  find  a  metric 
that  is  significantly  better  than  statement  count. 

£.4_.  Comparison  of  Methodologies 

The  five  metrics  were  used  to  compare  the  different  groups 
of  teams.  This  part  of  the  study  uses  the  Kruskal-Wallis  test 
and  the  Mann-Whitney  U  test  [Siegel]  to. determine  if  a  particular 
group  appears  to  have  a  better  value  than  another.  The  groups 
were  compared  with  respect  to  the  values  of  the  slope  and  the 
coefficient  of  determination.  Note  that  the  slope  of  the  line 


1  6 


has  units  of  changes  per  unit  of  complexity.  Thus  the  larger  the 
slope,  the  more  changes  made  in  the  face  of  a  given  level  of  com¬ 
plexity  and  (supposedly)  the  worse  the  methodology  or  programmer 
which  produced  it.  Statistically,  the  coefficient  of  determina¬ 
tion  is  a  measure  of  the  amount  of  variation  in  the  dependent 
variable  (program  changes)  that  may  be  explained  by  the  variation 
in  the  independent  variable  (the  product  metric).  In  this  case, 
we  contend  that  the  coefficient  of  determination  is  a  measure  of 
the  consistency  of  the  group  with  respect  to  spreading  the  prob¬ 
lems  evenly  through  the  code  (as  measured  by  the  complexity 
metric).  Under  the  hypothesis  that  methodology  makes  a  group  act 
more  like  an  individual  in  terms  of  the  consistency,  one  would 
expect  that  the  disciplined  teams  would  have  a  coefficient  of 
determination  which  is  slightly  lower  (less  consistent)  than  the 
ad  hoc  individuals  but  larger  (more  consistent)  than  the  ad  hoc 
teams.  The  results  appear  in  Tables  5  and  4.  The  CALLS  metric 
does  not  appear  in  these  tables  because  none  of  the  statistics 
are  significant  with  regard  to  it.  Appendix  2  shows  the  raw  data 
sorted  and  displayed  to  illustrate  the  contribution  of  each 
group. 

The  Kruskal-Wallis  test  yields  a  significance  level  of 
between  .02  and  .05  (depending  upon  the  metric)  in  favor  of  there 
being  some  difference  among  the  slopes  of  the  groups. 

As  may  be  seen  in  Table  3,  the  3lope  of  the  line  is  larger 
for  the  ad  hoc  individuals.  This  means  that  the  disciplined 
teams  do  a  better  job  (by  requiring  fewer  program  changes)  for  a 
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slopes 


if' 


Methodology  Comparisons 
(using  the  Mann-Whitney  U  test) 
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AI 

> 

DT 

at  .008 

level 

AT 

m 

DT 

Table  3 


given  amount  of  complexity  than  the  ad  hoc  individuals. 


Furthermore,  the  ad  hoc  teams  did  better  at  coping  with  syn¬ 
tactic  complexity  than  the  ad  hoc  individuals  (particularly  for 
the  DS  metric),  indicating  that  teamwork,  even  when  undiscip¬ 
lined,  can  show  some  advantages.  The  disciplined  teams  do  show  a 
superiority  over  the  ad  hoc  teams  for  the  CV  metric. 


For  the  coefficient  of  determination,  the  Kruskal-Wallis 
test  gives  a  significance  level  of  .03  to  .10  in  favor  of  there 
being  some  difference  among  the  groups.  The  ad  hoc  teams  seem  to 
have  a  lower  coefficient  of  determination  than  the  ad  hoc  indivi¬ 
duals.  It  is  conjectured  that  this  results  from  the  differing 
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Methodology  Comparisons 
(using  the  Mann-Whitney  U  test) 
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Table  4 


abilities  of  the  members  of  the  ad  hoc  teams  causing  different 
parts  of  the  system  to  be  assembled  with  varying  degrees  of 
effectiveness.  It  is  interesting  to  note  that  the  disciplined 
teams  appear  to  fall  somewhere  between  the  other  two  groups. 
This  also  indicates  that  a  team  that  works  with  a  set  of  metho¬ 
dologies  tends  to  be  more  consistent  than  a  team  that  does  not. 
In  fact,  the  data  indicating  that  the  disciplined  teams  have  a 
lower  coefficient  of  determination  than  ad  hoc  individuals  is 
very  weak,  lending  more  justification  to  the  consistency 
hypothesis . 
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_5  •  Orthogonality  of  the  metrics 

If  the  complexity  of  computer  programs  is  to  he  measured,  it 
is  necessary  to  develop  metrics  which  have  a  degree  of  ortho¬ 
gonality,  i.e.  metrics  which  measure  different  aspects  of  the 
complexity.  As  was  seen  in  the  correlation  matrix  of  Table  1, 
the  metrics  considered  so  far  lack  this  property.  One  possible 
way  to  gain  some  orthogonality  is  to  normalize  the  metrics.  For 
example,  if  cyclomatic  complexity  is  normalized  with  respect  to 
length  (by  computing  CV/ST)  the  resulting  metric  is  a  measure  of 
decision  density  in  the  code.  One  might  then  ask  if  code  with  a 

high  decision  density  requires  more  program  changes  than  code 

#  ’ 

with  a  low  decision  density.  For  our  data,  the  answer  is  no. 
Similar  results  (or  lack  thereof)  hold  for  CA  and  DS  normalized 
by  ST. 


A  mild  relationship  does  seem  to  exist  between  SC/ST  and 
program  changes.  It  is  conjectured,  however,  that  this  is  so 
because  the  nesting  penalties  as  defined  for  SC  produce  a  multi¬ 
ple  of  length  for  nested  statements,  so  that  SC/ST  is  related  to 
nesting  depth  which  is  related  to  length.  The  normalized  metrics 
were  also  tried  in  multiple  regression  equations  with  the  all  of 
the  original  metrics,  using  incremental  regression  techniques 
[Neter  4  Vasserman].  The  normalized  metrics  proved  to  yield  lit¬ 
tle  additional  information  in  the  equations.  Hence  no  truly 
orthogonal  metrics  within  the  study  of  syntactic  metrics  which 
help  explain  program  changes  have  been  uncovered. 
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We  believe  that  orthogonal  metrics  may  exist  outside  the 
realm  of  syntactic  complexity.  Metrics  that  measure  other  pro¬ 
perties  of  programs  and  program  development,  e.g.  data  metrics 
[Basil!  4  Turner  76;  Dunsmore;  Elshoff;  Stevens,  Myers  &  Constan¬ 
tine;  Weiser],  may  prove  orthogonal  to  the  control  structure 
metrics  studied  here.  We  are  currently  investigating  a  variety 
of  metric  classes. 

6.*  Conclusions 

A  family  of  syntactic  complexity  metrics  has  been  defined 
which  encompasses  many  of  the  current  metrics.  The  family  has 
been  used  in  comparing  different  individuals,  metrics,  and 
development  methodologies. 

It  was  found  that  individuals  differ  widely  in  the  number  of 
program  changes  required  to  implement  a  program  of  some  given 
complexity.  This  leads  to  the  possibility  of  measuring  a 
programmer's  ability  to  cope  with  complexity.  This  ability  meas¬ 
ure  concept  should  be  pursued  with  complexity  metrics  from  other 
groups  of  metrics  (such  as  data  complexity  metrics). 

Furthermore,  we  have  some  evidence  that  a  team  acts  more 
capably  than  an  individual  as  measured  by  the  slopes  of  the  fit¬ 
ted  regression  lines.  This  lends  support  to  the  argument  that 
even  small  projects  which  one  person  might  be  able  to  do  will 
none  the  less  be  done  better  if  more  than  one  person  cooperates, 
in  the  development  (at  least  when  they  take  active  steps,  such  as 
the  use  of  various  methodologies  to  aid  in  their  communication). 
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Finally,  it  was  shown  that  ad  hoc  teams  show  far  less  con¬ 
sistency  in  their  ability  to  cope  with  complexity  than  individu¬ 
als  or  disciplined  teams,  giving  further  support  to  the  claim 
that  a  disciplined  team  tends  to  gain  the  advantages  of  a  team 
(lower  slope)  while  maintaining  the  more  consistent  properties  of 
a  single  individual. 
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